Review of Independent Samples Comparisons (including ANOVA)
Simple Linear Regression Models
Fitting an OLS model
Performance of an OLS model
Checking the Fit
Transformations
What makes a “good” fit?
Load packages and set theme
library(ggrepel) ## new: for building plots with textlibrary(glue) ## new: for combining stringslibrary(googlesheets4) ## new: importing Google Sheets datalibrary(naniar) ## new: counting missingnesslibrary(knitr)library(kableExtra) ## for neatening tables in slideslibrary(janitor)library(car) ## for boxCox functionlibrary(infer) ## bootstrappinglibrary(patchwork)library(rstanarm)library(easystats)library(tidyverse)theme_set(theme_bw())knitr::opts_chunk$set(comment =NA)source("c09/data/Love-431.R") # for the lovedist() function
Importing Data on Favorite Movies
gs4_deauth() # indicates to Google Drive that you're reading a public fileurl <-"https://docs.google.com/spreadsheets/d/155iHDSUr8ZixX4nVcNq9HMkKbBizUhCsXHcteIdNddU"mov1 <-read_sheet(url)
# A tibble: 13 × 2
superhero movie
<dbl> <chr>
1 1 Avengers: Infinity War
2 1 Avengers: Endgame
3 1 Big Hero 6
4 1 Black Panther
5 1 Captain Marvel
6 1 The Dark Knight
7 1 The Dark Knight Rises
8 1 Doctor Strange
9 1 Iron Man
10 1 Mystery Men
11 1 Spider Man: Into The Spider-Verse
12 1 Thor: Love and Thunder
13 1 V for Vendetta
# A tibble: 18 × 1
movie
<chr>
1 10 Things I Hate About You
2 About Time
3 Clueless
4 Coming To America
5 Crazy Rich Asians
6 Elf
7 Flipped
8 Harold and Maude
9 High School Musical 2
10 Jab We Met (When We Met)
11 La La Land
12 Legally Blonde
13 Life As We Know It
14 Life Is Beautiful
15 The Lobster
16 Mamma Mia!
17 Mamma Mia: Here We Go Again
18 Monte Carlo
# A tibble: 16 × 1
movie
<chr>
1 Murder Mystery
2 My Big Fat Greek Wedding
3 Notting Hill
4 Om Shanti Om (Peace Be With You)
5 The Parent Trap
6 Real Genius
7 Scott Pilgrim vs. The World
8 The Secret Life of Walter Mitty
9 Shrek
10 Shrek 2
11 Tangled
12 Thor: Love and Thunder
13 Trolls
14 When Harry Met Sally
15 Yeh Jawaani hai Deewani (Youth is Crazy)
16 Zindagi Na Milegi Dobara (You Only Live Once)
Thor: Love and Thunder does not have the “Romantic Comedy” genre. Only nine of our movies do (About Time, Clueless, Coming to America, Crazy Rich Asians, Harold and Maude, Legally Blonde, Mamma Mia!, My Big Fat Greek Wedding and Notting Hill.)
# A tibble: 14 × 1
movie
<chr>
1 2001: A Space Odyssey
2 About Time
3 Alien
4 Avatar
5 Back to the Future
6 Back to the Future Part II
7 Blade Runner 2049
8 Cloud Atlas
9 Despicable Me
10 Divergent
11 Eternal Sunshine of the Spotless Mind
12 Everything, Everywhere, All at Once
13 Face/Off
14 Gattaca
# A tibble: 20 × 1
movie
<chr>
1 Gravity
2 The Hunger Games
3 The Hunger Games: Catching Fire
4 Inception
5 Interstellar
6 Jurassic Park
7 The Lobster
8 The Matrix
9 Minions: The Rise of Gru
10 The Platform (El Hoyo)
11 Real Genius
12 Real Steel
13 Resident Evil
14 Rise of the Guardians
15 Sorry To Bother You
16 Star Wars Episode III: Revenge of the Sith
17 Star Wars: Episode IV: A New Hope
18 Star Wars: Episode V: The Empire Strikes Back
19 Star Wars: Episode VI: Return of the Jedi
20 War of the Worlds
Are these groups comparable?
Movie Questions for Today
Do movies released in 1942-2010 have more user ratings than movies released after 2010? (imdb_ratings, year)
How do movie lengths vary by MPA ratings? (length, mpa)
How strong is the association between how often a movie is rated on IMDB and its number of stars? (imdb_ratings, imdb_stars)
Question 1
Do movies released in 1942-2010 have more user ratings than movies released after 2010? (imdb_ratings, year)
Use our OLS model for sqrt(imdb_ratings) to make predictions on the square root scale.
estimate_means(fit1, ci =0.89, by ="release", transform ="none") |>kbl(digits =4)
release
Mean
SE
CI_low
CI_high
Older
675.7814
28.4476
630.1371
721.4257
After2010
623.0566
40.2310
558.5058
687.6074
Note that \(675.7814^2 \approx 456681\) and \(623.0566^2 \approx 388200\)
Making Predictions (2/2)
Use our model for sqrt(imdb_ratings) to make predictions on the original scale of imdb_ratings.
estimate_means(fit1, ci =0.89, by ="release", transform ="response") |>kbl(digits =0)
release
Mean
SE
CI_low
CI_high
Older
456681
38449
397073
520455
After2010
388200
50132
311929
472804
Summary for Question 1
Do movies released in 1942-2010 have more user ratings than movies released after 2010?
Group
Sample Mean
Sample Median
Older
592,217.1
367,000
After2010
482,915.6
369,500
Difference
109,301.5
-2,500
Bootstrap means: diff = 109302, 89% CI (-7165, 219881)
Bootstrap medians: diff = -2500, 89% CI (-124943, 122500)
Question 1 Models
Do movies released in 1942-2010 have more user ratings than movies released after 2010?
OLS: Square root of user ratings for a movie released in 1942-2010 is, on average, 52.72 (89% CI: -26, 132) higher than for a movie released after 2010.
Bayes: Square root of user ratings for a movie released in 1942-2010 is, on average, 54.06 (89% CI: -25, 131) higher than for a movie released after 2010.
Question 2
How do movie lengths vary by MPA ratings? (length, mpa)
# A tibble: 3 × 10
Level1 Level2 Difference CI_low CI_high SE df t p contr
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 (PG-13) PG 16.7 8.30 25.1 4.00 200 4.18 0.000132 (PG-13) -…
2 (PG-13) R 0.491 -7.75 8.74 3.92 200 0.125 0.900 (PG-13) -…
3 R PG 16.2 7.61 24.8 4.10 200 3.96 0.000206 R - PG
Plot Holm comparisons
ggplot(con_holm_tib, aes(x = contr, y = Difference)) +geom_point(size =3, col ="purple") +geom_errorbar(aes(ymin = CI_low, ymax = CI_high)) +geom_hline(yintercept =0, col ="red", lty ="dashed") +labs(title ="Holm 89% Intervals for Movie Length",x ="Contrast", y ="Difference in Length")
Plot Holm comparisons
Summary for Question 2
How do movie lengths vary by MPA ratings?
MPA
\(n\)
Sample Mean
Sample Median
PG-13
74
128.4
123
R
67
128.0
122
PG
62
111.7
106.5
In the sample, PG movies are shorter by 16-17 minutes.
Pairwise 89% contrasts yield larger differences between PG and the other mpa groups, than between PG-13 and R.
Simple Linear Regression
Question 3
How strong is the association between how often a movie is rated on IMDB and its number of stars? (imdb_ratings, imdb_stars)
We’ll look at stars (on the y axis) vs. ratings (in 100,000s, on x).
ggplot(mov2, aes(x = imdb_ratings/100000, y = imdb_stars)) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="Hundreds of Thousands of IMDB ratings",y ="Weighted average star rating")
Scatterplot of 228 movies
Pearson Correlation
cor(mov2$imdb_stars, mov2$imdb_ratings)
[1] 0.6131559
cor(mov2$imdb_stars, (mov2$imdb_ratings/100000))
[1] 0.6131559
We can add or multiply by a constant without changing the Pearson correlation coefficient.
cor_test(mov2, "imdb_stars", "imdb_ratings")
Parameter1 | Parameter2 | r | 95% CI | t(226) | p
--------------------------------------------------------------------
imdb_stars | imdb_ratings | 0.61 | [0.53, 0.69] | 11.67 | < .001***
Observations: 228
OLS model with imdb_ratings
fit5 <-lm(imdb_stars ~ imdb_ratings, data = mov2)model_parameters(fit5, ci =0.89)
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
When comparing any two movies whose # of IMDB user ratings are 100,000 apart, we see a star rating that is 0.10 stars (89% CI 0.08, 0.11) higher, on average, for the movie with more user ratings, according to this model.
Our model fit6 assumes that our errors (residuals) come from a Normal distribution with mean 0 and standard deviation Sigma (\(\sigma\)) = 0.69.
Thus, about 68% of our predictions should be within \(\pm\) 0.69 stars of the correct outcome, and 95% of our predictions should be within \(\pm 2 \times 0.69\), or 1.38 stars.
Planned Break between Classes 09 and 10.
Performance Measures (1/5)
AIC: Akaike’s Information Criterion
AICc: Second-order (or small sample) AIC with a correction for small sample sizes
BIC: Bayesian Information Criterion
AIC, AICc and BIC are used when comparing one or more models for the same outcome. When comparing models fit using maximum likelihood (like OLS linear models), the smaller the AIC or BIC, the better the fit.
Performance Measures (2/5)
R2: r-squared value = 0.376
The R-squared (\(R^2\)) measure for an OLS fit describes how much of the variation in our outcome can be explained using our model (and its predictors.) \(R^2\) falls between 0 and 1, and the closer it is to 1, the better the model fits our data.
In a simple (one-predictor) OLS model like this, the value is also the square of the Pearson correlation coefficient, \(r\).
We called this \(R^2\) “eta-squared” (\(\eta^2\)) in ANOVA.
Performance Measures (3/5)
R2 (adj.): adjusted r-squared value = 0.373
Adjusted R-squared is an index (so it’s not a proportion of anything) for comparing different models (different predictor sets) for the same outcome.
The idea is to reduce the temptation to overfit the data, by penalizing the \(R^2\) value a little for each predictor.
Adjusted \(R^2\) is usually between 0 and 1, but can be negative.
Its formula accounts for the number of observations and the number of predictors in the model.
The adjusted \(R^2\) measure can never be larger than \(R^2\).
Performance Measures (4/5)
RMSE = 0.687
The RMSE is the square root of the variance of the residuals and summarizes the difference between the observed data and the model’s predicted values.
It can be interpreted as the standard deviation of the unexplained variance, and has the same units as the outcome.
When comparing models using the same data for the same outcome (but, for instance, with different predictor sets), lower RMSE values indicate better model fit.
Performance Measures (5/5)
Sigma = 0.690
Linear models assume that their residuals are drawn from a Normal distribution with mean 0 and standard deviation equal to sigma (\(\sigma\)).
This indicates that the predicted outcome will be within \(\pm 1 \sigma\) units of the observed outcome for approximately 68% of the data points, for example.
Checking OLS Model Fit
Main assumptions of any simple linear regression are:
linearity: we assume that the outcome is linearly related to our predictor
constant variance (homoscedasticity): we assume that the variation of our outcome is about the same regardless of the value of our predictor
normal distribution: we assume that the errors around the regression model at any specified values of the x-variables follow an approximately Normal distribution.
To check these assumptions, consider the following plots.
# A tibble: 5 × 2
mov_id movie
<chr> <chr>
1 M-041 The Dark Knight
2 M-032 Chinese Doctors (Zhong guo yi sheng)
3 M-069 The Gingerdead Man
4 M-130 Madea Goes To Jail
5 M-183 The Room
# A tibble: 3 × 3
mov_id movie imdb_stars
<chr> <chr> <dbl>
1 M-069 The Gingerdead Man 3.4
2 M-183 The Room 3.6
3 M-130 Madea Goes To Jail 4.6
Posterior Predictive Checks
fit6_diagnostic_plots[[1]]
Box-Cox suggestion?
boxCox(fit6)
imdb_stars squared?
p1 <-ggplot(mov2, aes(x = imdb_ratings/100000, y = imdb_stars)) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="Hundreds of Thousands of IMDB ratings",y ="Weighted average star rating",title ="No transformation")p2 <-ggplot(mov2, aes(x = imdb_ratings/100000, y = imdb_stars^2)) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="Hundreds of Thousands of IMDB ratings",y ="Square of Weighted average star rating",title ="Outcome squared")p1 + p2
imdb_stars squared?
Could we transform either variable?
p3 <-ggplot(mov2, aes(x = imdb_ratings/100000)) +geom_histogram(binwidth =1, col ="white", fill ="dodgerblue") +stat_function(fun =function(x)dnorm(x, mean =mean(mov2$imdb_ratings/100000, na.rm =TRUE),sd =sd(mov2$imdb_ratings/100000, na.rm =TRUE)) *length(mov2$imdb_ratings/100000) *1, geom ="area", alpha =0.5,fill ="lightblue", col ="blue") +labs(x ="Hundreds of Thousands of IMDB ratings", y ="",title ="IMDB_Ratings / 100000")p4 <-ggplot(mov2, aes(x = imdb_stars)) +geom_histogram(binwidth =0.2, col ="black", fill ="gold") +stat_function(fun =function(x)dnorm(x, mean =mean(mov2$imdb_stars,na.rm =TRUE),sd =sd(mov2$imdb_stars, na.rm =TRUE)) *length(mov2$imdb_stars) *0.2, geom ="area", alpha =0.5,fill ="grey80", col ="black") +labs(x ="Weighted average star rating", y ="",title ="IMDB Stars")p3 + p4
Could we transform either variable?
Transforming IMDB_ratings?
p3a <-ggplot(mov2, aes(x =log(imdb_ratings))) +geom_histogram(binwidth =0.5, col ="white", fill ="dodgerblue") +stat_function(fun =function(x)dnorm(x, mean =mean(log(mov2$imdb_ratings), na.rm =TRUE),sd =sd(log(mov2$imdb_ratings), na.rm =TRUE)) *length(log(mov2$imdb_ratings)) *0.5, geom ="area", alpha =0.5,fill ="lightblue", col ="blue") +labs(x ="Log of IMDB ratings", y ="", title ="Log of IMDB_Ratings")p3b <-ggplot(mov2, aes(x =sqrt(imdb_ratings))) +geom_histogram(binwidth =100, col ="white", fill ="dodgerblue") +stat_function(fun =function(x)dnorm(x, mean =mean(sqrt(mov2$imdb_ratings), na.rm =TRUE),sd =sd(sqrt(mov2$imdb_ratings), na.rm =TRUE)) *length(sqrt(mov2$imdb_ratings)) *100, geom ="area", alpha =0.5,fill ="lightblue", col ="blue") +labs(x ="Square Root of IMDB ratings", y ="", title ="Square Root of IMDB_Ratings")p3a + p3b
Transforming IMDB_ratings?
New Scatterplot?
p5 <-ggplot(mov2, aes(x = imdb_ratings, y = imdb_stars)) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="IMDB ratings",y ="Weighted average star rating",title ="No transformation")p6 <-ggplot(mov2, aes(x =sqrt(imdb_ratings), y = imdb_stars^2)) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="Square Root of IMDB Ratings",y ="Weighted average star rating",title ="Square Root of Predictor")p5 + p6
New Scatterplot?
Model with \(\sqrt{\mbox{IMDB_ratings}}\)
fit7 <-lm(imdb_stars ~sqrt(imdb_ratings), data = mov2)model_parameters(fit7, ci =0.89)
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
Suppose we have two movies whose square root of # of IMDB user ratings is 1000 apart. On average, the star rating is 1.60 stars (89% CI 1.40, 1.81) higher for the movie with more IMDB user ratings, according to model fit8.
Scatterplot for model fit8
ggplot(mov4, aes(x = sqrtratK, y = imdb_stars)) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="(square root of IMDB ratings)/1000",y ="Weighted average star rating",title ="Transformation for `fit8` model")
Scatterplot for model fit8
Model fit8 vs. fit6 performance
fit6 <-lm(imdb_stars ~ users_100k, data = mov4)fit8 <-lm(imdb_stars ~ sqrtratK, data = mov4)model_performance(fit6)
ggplot(mov5, aes(x = imdb_ratings/100000, y = imdb_stars)) +geom_point() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="IMDB ratings / 100000",y ="Weighted average star rating",title =glue(nrow(mov5), " Movies with over 1M IMDB Ratings"))
Is this a good idea?
Restricting the Sample???
Labeling the Movies in the Scatterplot
ggplot(mov5, aes(x = imdb_ratings/100000, y = imdb_stars, label = movie)) +geom_point() +geom_text_repel() +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =TRUE, col ="red") +labs(x ="IMDB ratings / 100000",y ="Weighted average star rating",title =glue(nrow(mov5), " Movies with over 1M IMDB Ratings"))
Labeling the Movies in the Scatterplot
Our subset of 38 movies
p5a <-ggplot(mov5, aes(x = imdb_ratings/100000)) +geom_histogram(binwidth =1, col ="white", fill ="dodgerblue") +stat_function(fun =function(x)dnorm(x, mean =mean(mov5$imdb_ratings/100000, na.rm =TRUE),sd =sd(mov5$imdb_ratings/100000, na.rm =TRUE)) *length(mov5$imdb_ratings/100000) *1, geom ="area", alpha =0.5,fill ="lightblue", col ="blue") +labs(x ="Hundreds of Thousands of IMDB ratings", y ="",title ="IMDB_Ratings / 100000")p5b <-ggplot(mov5, aes(x = imdb_stars)) +geom_histogram(binwidth =0.1, col ="black", fill ="gold") +stat_function(fun =function(x)dnorm(x, mean =mean(mov5$imdb_stars,na.rm =TRUE),sd =sd(mov5$imdb_stars, na.rm =TRUE)) *length(mov5$imdb_stars) *0.1, geom ="area", alpha =0.5,fill ="grey80", col ="black") +labs(x ="Weighted average star rating", y ="",title ="IMDB Stars")p5a + p5b
Our subset of 38 movies
Model fit9 for 38 movies
fit9 <-lm(imdb_stars ~ users_100k, data = mov5)model_parameters(fit9, ci =0.89)
Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
using a Wald t-distribution approximation.
If two movies each have over 1 million IMDB ratings, and the movies have a 100,000 user difference in IMDB ratings, then the movie with more ratings will, on average, have a star rating that is 0.05 stars higher (89% CI: 0.04, 0.07) than the movie with fewer ratings, according to model fit9.
If two movies with over 1 million IMDB ratings have a 100,000 user difference in IMDB ratings, then the movie with more ratings will, on average, have a star rating that is 0.05 stars higher (89% CI: 0.04, 0.06) than the movie with fewer ratings, according to fit10.
Note that we have some new summaries now. The \(R^2\), RMSE and Sigma values can be compared to fit9 which used the same data. On the whole, fit10 looks slightly worse than fit9 on these metrics.